m6Aminer: Predicting the m6Am Sites on mRNA by Fusing Multiple Sequence-Derived Features into a CatBoost-Based Classifier

As one of the most important post-transcriptional modifications, m6Am plays a fairly important role in conferring mRNA stability and in the progression of cancers. The accurate identification of the m6Am sites is critical for explaining its biological significance and developing its application in the medical field. However, conventional experimental approaches are time-consuming and expensive, making them unsuitable for the large-scale identification of the m6Am sites. To address this challenge, we exploit a CatBoost-based method, m6Aminer, to identify the m6Am sites on mRNA. For feature extraction, nine different feature-encoding schemes (pseudo electron–ion interaction potential, hash decimal conversion method, dinucleotide binary encoding, nucleotide chemical properties, pseudo k-tuple composition, dinucleotide numerical mapping, K monomeric units, series correlation pseudo trinucleotide composition, and K-spaced nucleotide pair frequency) were utilized to form the initial feature space. To obtain the optimized feature subset, the ExtraTreesClassifier algorithm was adopted to perform feature importance ranking, and the top 300 features were selected as the optimal feature subset. With different performance assessment methods, 10-fold cross-validation and independent test, m6Aminer achieved average AUC of 0.913 and 0.754, demonstrating a competitive performance with the state-of-the-art models m6AmPred (0.905 and 0.735) and DLm6Am (0.897 and 0.730). The prediction model developed in this study can be used to identify the m6Am sites in the whole transcriptome, laying a foundation for the functional research of m6Am.


Introduction
In recent years, post-transcriptional modifications of RNA have become an important focus of biological research. Since the discovery of the first RNA modification [1], more than 170 post-transcriptional modifications have been discovered on almost all types of RNA [2], including mRNA, tRNA, rRNA, and so on [3]. RNA modifications have demonstrated a high specificity and efficiency in regulating biological functions, such as their involvement in RNA translation [4], embryonic stem cell development [5], cancer cell survival, and migration [6]. Moreover, studies have shown that mutations in many RNA-modifying enzymes are involved in the development of human diseases, such as motor neuron disorders and autosomal-recessive intellectual disability (ARID) [7].
Among the numerous RNA modifications, a reversible modification called N6,2 -Odimethyladenosine (m6Am) exists on mRNA and lncRNA in higher eukaryotes [8,9]. Since the first nucleotide after the cap is adenosine, it can be methylated on the 2 -hydroxyl

Model Performance with Different Machine Learning Algorithms and Feature Subsets
To find the appropriate machine learning algorithm, five different classifiers, Ada-Boost, random forest (RF), K nearest neighbors (KNN), support vector machine (SVM), and CatBoost, were implemented and trained on the 10 sub-training datasets. For feature extraction, 1120 features were extracted using the above-mentioned 9 feature encoding schemes on the 10 sub-training datasets. For model training, 10-fold cross-validation was used on each sub-training dataset.
As shown in Table 1, the CatBoost-based model achieved the best performance, with an average ACC of 0.834 ± 0.003, an AUC of 0.912 ± 0.002, an Sn of 0.805 ± 0.004, an Sp of 0.863 ± 0.004, an F1 of 0.829 ± 0.003, and an MCC of 0.669 ± 0.006. Compared with the other four classifiers, the CatBoost-based model successfully increased the ACC by over 0.7%, the AUC by over 1%, the Sn by over 0.7%, the F1 by over 0.8%, and the MCC by over 1.2%. Although the SVM Sp value was higher than the CatBoost Sp value, the other metrics were significantly lower than the CatBoost metrics ( Figure 2). Simultaneously, for the RF-based model, the standard deviations (SDs) of six metrics except for Sp were the lowest, indicating that RF has the least volatility on different data sets. However, the six metrics of RF were significantly lower than those of CatBoost. Therefore, the m6Aminer utilized the CatBoost algorithm to construct the model.

Model Performance with Different Machine Learning Algorithms and Feature Subsets
To find the appropriate machine learning algorithm, five different classifiers, AdaBoost, random forest (RF), K nearest neighbors (KNN), support vector machine (SVM), and CatBoost, were implemented and trained on the 10 sub-training datasets. For feature extraction, 1120 features were extracted using the above-mentioned 9 feature encoding schemes on the 10 sub-training datasets. For model training, 10-fold cross-validation was used on each sub-training dataset.
As shown in Table 1, the CatBoost-based model achieved the best performance, with an average ACC of 0.834 ± 0.003, an AUC of 0.912 ± 0.002, an Sn of 0.805 ± 0.004, an Sp of 0.863 ± 0.004, an F1 of 0.829 ± 0.003, and an MCC of 0.669 ± 0.006. Compared with the other four classifiers, the CatBoost-based model successfully increased the ACC by over 0.7%, the AUC by over 1%, the Sn by over 0.7%, the F1 by over 0.8%, and the MCC by over 1.2%. Although the Sp value of SVM was higher than that of CatBoost, the other metrics were significantly lower than the CatBoost metrics ( Figure 2). Simultaneously, for the RF-based model, the standard deviations (SDs) of six metrics except for Sp were the lowest, indicating that RF has the least volatility on different data sets. However, the six metrics of RF were significantly lower than those of CatBoost. Therefore, the m6Aminer utilized the CatBoost algorithm to construct the model.
To compare its performance with different feature subsets, the CatBoost-based model was trained on the 10 sub-training datasets with different feature subsets. The hyperparameters of the CatBoost classifier were set to their default parameters. Using different feature-encoding schemes, the mean AUC values of our models were all above 0.870 (Table 2). Among the models, PseEIIP achieved the highest AUC of 0.910 ± 0.002 on the 10 sub-training datasets when adopting a single feature extraction method. However, the other schemes also showed decent performance on the 10 sub-training datasets. As shown in Table 2, the performance achieved when combining multiple feature encoding schemes for our model was better than when a single method was utilized.   To compare its performance with different feature subsets, the CatBoost-based model was trained on the 10 sub-training datasets with different feature subsets. The hyperparameters of the CatBoost classifier were set to their default parameters. Using different feature-encoding schemes, the mean AUC values of our models were all above 0.870 (Table 2). Among the models, PseEIIP achieved the highest AUC of 0.910 ± 0.002 on the 10 sub-training datasets when adopting a single feature extraction method. However, the other schemes also showed decent performance on the 10 sub-training datasets. As shown

Feature Ranking and Selection
The ExtraTreesClassifier algorithm is an ensemble classifier utilized for feature ranking. The importance scores of features also can be obtained to analyze the performance of the nine feature extraction methods (Supplementary Data S1). It can be seen that DNM, K-mer, Ksnpf, PseEIIP, and SCPseTNC showed a high correlation with the classification results ( Figure 3A), indicating that these features play a fairly important role in this classification task. The importance of the top 20 important features ( Figure 3B) also shows that the features extracted via DNM, K-mer, Ksnpf, PseEIIP, and SCPseTNC have a greater impact on model performance. It can be distinctly seen that the proportions of K-mer, PseEIIP, and SCPseTNC in the top 20 features are 30% (6/20), 25% (5/20), and 20% (4/20), respectively. These results confirm that there is a strong correlation between the electronic properties and the physical and chemical properties of the corresponding nucleotide and the accurate recognition of the m6Am sites.

Model Optimization
After feature selection, GridSearchCV was utilized to find the best hyperparameters, e.g., "iterations", "depth", and "learning_rate". In this study, "iterations" was selected from {250, 100, 500, 1000, 1500, 2000}, "depth" was selected from 1 to 10, and "learn- The equidistance method was then used to select the key features. The top 50 features were used for the first iteration, and the second top 50 ranked features were then added to the feature subset in subsequent iterations, such as the top 50 features for the first iteration and the top 100 features for the second iteration, until the 1120 features were utilized to train the model (using 10-fold cross-validation on the 10 sub-training datasets). As shown in Figure 3C, our model obtained the best performance using the top 300 important features for feature encoding, and our model received an average AUC of 0.912 ± 0.002, ACC of 0.832 ± 0.002, Sn of 0.804 ± 0.003, Sp of 0.861 ± 0.005, F1 of 0.827 ± 0.002, MCC of 0.666 ± 0.005, respectively (the detailed results are listed in Supplementary Table S1).

Comparison with the State-of-the-Art Predictors
To further evaluate the performance of our model, m6Aminer was compared with the state-of-the-art predictors m6AmPred [25] and DLm6Am [26]. In m6AmPred, PseEIIP was used for feature extraction, and extreme gradient boosting with the dart algorithm (XgbDart) was used for model training. The grid research method was utilized, and m6AmPred received the best gamma = 1.05, learning_rate = 0.1, and eta = 0.03. Other parameter settings refer to the description by Jiang et al. [25]. In DLm6Am, one-hot, ND, and NCP were utilized for extraction, and the deep learning algorithm was used to train the model. The detail parameter settings were set as in Luo et al. [26]. These two models were both trained and optimized on our dataset with the same fold to ensure the fairness of comparison.
As shown in Table 4, our model achieved an average ACC of 0.834 ± 0.003, AUC of 0.913 ± 0.002, Sn of 0.806 ± 0.003, Sp of 0.861 ± 0.005, F1 of 0.829 ± 0.003, MCC of 0.668 ± 0.006 on the 10 sub-training datasets using 10-fold cross-validation. The six metrics of our model were higher than those of the other two models, and m6Aminer successfully increased the average ACC by over 0.7%, AUC by over 0.8%, Sn by over 0.3%, Sp by over 0.3%, F1 by over 0.7%, and MCC by over 1.3%, demonstrating that our predictor achieved better performance in identifying m6Am than the state-of-the-art predictors m6AmPred and DLm6Am. Simultaneously, the SD of the average AUC values for m6Aminer was 0.002, showing that the stability of our model is comparable with the state-of-the-art models. The mean is an average of the 10-fold cross-validation results of each predictor on the 10 sub-training datasets, while the SD is their standard deviation.
As shown in Table 5, our model received an average AUC of 0.754 ± 0.005, ACC of 0.647 ± 0.009, Sn of 0.874 ± 0.006, Sp of 0.420 ± 0.024, F1 of 0.713 ± 0.004, and MCC of 0.331 ± 0.015 on the independent testing dataset, which increased the average AUC by over 1.9%, ACC by over 0.5%, Sp by over 1.1%, F1 by over 0.3%, and MCC by over 0.9% (the details of the independent testing results are listed in Supplementary Table S3). Except for Sn, the other measurements were higher than those of m6AmPred and DLm6Am. Concurrently, the SDs of the six metrics for m6Aminer were lower than that of the other two models, proving that m6Aminer has the best stability on different datasets. This best stability indicates that our model achieves stronger competitiveness in terms of its generalization ability when compared to the other two models with a completely new dataset. To more intuitively compare the effectiveness of the three predictors, the ROC curves and precision-recall curves are also shown in Figures 4 and 5, respectively. The mean is an average of the independent testing results of each predictor on the 10 sub-training datasets, while the SD is their standard deviation.

Web Server
To save the experimental cost of m6Am identification, simplify the experimental process, and reduce the experimental time, a web server based on the flask framework was designed to predict the m6Am sites. It can be accessed at www.m6aminer.cn (accessed on 3 April 2023). It allows users to upload the RNA sequences for identification by submitting the fasta format sequences through the text box. As the length of the RNA sequence used for the training model was 41, the RNA sequence input by users should also be 41nt. The online tool can be used to predict the m6Am sites in real time. The Tencent cloud server will automatically process the data and then feed the predicted results back to the result webpage for the user's reference. The specific page of the webpage is shown in Figure 6. The source code for m6Aminer is also provided in the download section, which allows users to run the model locally.    Comparison of the precision-recall curves for three models on the independent testing dataset. Figure 5. Comparison of the precision-recall curves for three models on the independent testing datasets. and submit them. As the length of the RNA sequence used for the training model was 41, the RNA sequence input by users should also be 41nt. The online tool can be used to predict the m6Am sites in real time. The Tencent cloud server will automatically process the data and then feed the predicted results back to the result webpage for the user's reference. The specific page of the webpage is shown in Figure 6. The source code for m6Aminer is provided on the website in the download section, which allows users to run the model locally. Figure 6. The m6Aminer web server.

Discussions
Accurately predicting the existence of m6Am on RNA is particularly important for revealing the biological function of m6Am more deeply. In this study, we developed a new computational model, m6Aminer, to improve the prediction of m6Am. The model innovatively utilized nine feature schemes (PseEIIP, DBE, Hash, NCP, PseKNC, DNM, Kmer, SCPseTNC, and Ksnpf), and selected the top 300 features using the ExtraTreesClassifier algorithm. It was found that the features extracted by K-mer, PseEIIP, and SCP-seTNC accounted for the majority of the top 20 features, suggesting that the electronic, physical, and chemical properties of nucleotides have an important impact on the classification task. To the best of our knowledge, the currently available models for predicting m6Am using machine learning are m6AmPred and DLm6Am. Compared to the above models, m6Aminer excavated more important features related to the m6Am site. Additionally, our model does not consume GPU resources and requires little training time. According to the comprehensive comparison results of this study, it can be seen that our predictor achieves competitive performance when compared with m6AmPred and DLm6Am. Therefore, our method can be used as an efficient tool for identifying m6Am or can at least become an auxiliary means of identifying m6Am in the future.

Discussions
Accurately predicting the existence of m6Am on RNA is particularly important for revealing the biological function of m6Am more deeply. In this study, we developed a new computational model, m6Aminer, to improve the prediction of m6Am. The model innovatively utilized nine feature schemes (PseEIIP, DBE, Hash, NCP, PseKNC, DNM, Kmer, SCPseTNC, and Ksnpf), and selected the top 300 features using the ExtraTreesClassifier algorithm. It was found that the features extracted by K-mer, PseEIIP, and SCPseTNC accounted for the majority of the top 20 features, suggesting that the electronic, physical, and chemical properties of nucleotides have an important impact on the classification task. To the best of our knowledge, the currently available models for predicting m6Am using machine learning are m6AmPred and DLm6Am. Compared to the above models, m6Aminer excavated more important features related to the m6Am site. Additionally, our model does not consume GPU resources and requires little training time. According to the comprehensive comparison results of this study, it can be seen that our predictor achieves competitive performance when compared with m6AmPred and DLm6Am. Therefore, our method can be used as an efficient tool for identifying m6Am or can at least become an auxiliary means of identifying m6Am in the future.
However, our method has some limitations. Firstly, the limited experimental data have a great impact on the generalization ability of the model. In the future, more sequences containing the m6Am sites will be collected to enhance the recognition accuracy for the m6Am sites. Furthermore, this study only used sequence features. In future studies, RNA secondary structure features or genomic features could also be used to extract more useful features for identifying the m6Am sites. Finally, some more advanced algorithms should be used to optimize our model, such as transfer learning.

Benchmark Dataset
In this study, benchmark datasets were downloaded from http://47.94.248.117/DLm6 Am (accessed on 3 April 2023) and https://www.xjtlu.edu.cn/biologicalsciences/m6am (accessed on 3 April 2023). To remove sequence redundancy, the benchmark datasets were screened with CD-HIT, using the most rigorous threshold of 0.8 [28]. A total of 3700 positive samples and 37,000 negative samples were assigned to the training dataset after the above operation, while the independent testing dataset consisted of 320 positive samples and 320 negative samples. To construct an approximate 1:1 positive-to-negative ratio dataset, 37,000 negative samples were split into 10 negative datasets. Ten sub-training datasets were generated by combining each of these then negative datasets with a positive dataset. Their prediction performances were averaged during the evaluation to reduce batch variance [25]. Furthermore, the 10 sub-training datasets were used for feature selection and model construction, while the independent testing datasets only participated in the independent test.

Pseudo Electron-Ion Interaction Potential (PseEIIP)
A feature extraction method using the electron-ion interaction potential (EIIP) values of nucleotides was pioneered by Nair and Sreenadhan [29]. The EIIP value is calculated from the energy of the delocalized electrons in amino acids or nucleotides. This method was initially used to map exons and has been widely used to build models to identify some modifications, such as m5C [30][31][32].
In PseEIIP, the extracted features are divided into two parts: the distribution of freeelectron energy along the RNA sequence and the product of three consecutive nucleotides' free-electron energies and frequencies. In the portion of the free-electron energy distributed along the RNA sequence, the nucleotide of each RNA sequence is encoded as a numerical value representing its electron-ion interaction potential. The values corresponding to each nucleotide are shown in Table 6. In the second part, the three consecutive nucleotides' free-electron energies and their frequencies of occurrence are calculated separately. The freeelectron energy of any three continuous nucleotides is compiled by that of three individual ones, and the frequency is the current tri-consecutive nucleotide's total number divided by those occurrence times. The formula is as follows: EI IP xyz = EI IP x + EI IP y + EI IP z (1) Among these variables, N xyz represents the number of occurrences of three consecutive nucleotides in a sequence, and L represents the sequence length. Therefore, for any given RNA sequence, a feature vector of L + 64 dimensions can be obtained by using the PseEIIP method. In this study, the sequence length of each sample was 41nt, and a total of 105 features were extracted from an RNA sequence. The feature vectors are as follows: PseEI IP = [EI IP AAA f AAA , EI IP AAC f AAC , · · · · · · , EI IP TTT f TTT ]

Hash Decimal Conversion Method (Hash)
Hash [33] is a method of converting quaternary numbers into decimals, which represent the four bases in RNA sequences (A, C, G, and U). For an RNA sequence, the contiguous two bases can be considered a two-digit-quaternary number. Therefore, the sample of 41nt can be turned into 40 two-digit quaternary numbers. The transformation criteria for four bases when using hash are 'A'-0, 'G'-1, 'C'-2, and 'U'-3. Subsequently, each two-digit quaternary number can be converted into a decimal via the formula: Supposing that the length of each sample is m and the length of the selected bases is k, the number of the final decimal number obtained is m − k + 1. For the dataset used in the study, a 40-dimension feature vector was generated for each sample.

Nucleotide Chemical Property (NCP)
NCP was first proposed by Bari et al. to predict the splice sites in pre-mRNA [37]. Since the four nucleotides in the RNA sequence may provide different functional structures according to their chemical structures, three coordinate values were used to represent the different chemical properties of the four nucleotides. Firstly, both purines (A and G) and pyrimidines (C and U) have rings; purines have two rings, and pyrimidines have only one ring. Therefore, we used the x-coordinate information to distinguish the ring structures of purines and pyrimidines. Secondly, through the y-coordinate information, we judged the existence of amino groups (A and C) or keto groups (G and U). Finally, the z-coordinate information was determined via strong hydrogen bonds (C and G) and weak hydrogen bonds (A and U). According to the different chemical properties of the above four nucleotides, the encoding of a nucleotide in an RNA sequence satisfies the following formula: In detail, the four nucleotides A, C, G, and U were encoded as vectors (1,1,1), (0,1,0), (1,0,0), and (0,0,1), respectively. Therefore, each nucleotide in an RNA sequence was present as three elements, and a total of 123 features were extracted for each sample in the study.

Pseudo k-Tuple Composition (PseKNC)
PseKNC has been successfully applied to the identification of RNA or DNA modification by forming the physicochemical properties of oligonucleotides [38], such as MSLP [39], iterb-PPse [40], and mRNALocater [41]. Each sample can be expressed utilizing the following formula: where: where f n is the normalized frequency of 16 dinucleotides in the RNA sequences, θ j is the j-tier sequence correlation factor, and λ is the top count level of correlation in the samples.

Dinucleotide Numerical Mapping (DNM)
DNM is a feature extraction method proposed by Zu et al. which can be used to obtain a 48-dimensional feature vector based on the average, expectation, and variance of the dinucleotide [42]. For each sequence, 16 dinucleotides can be extracted by using the four bases (A, G, C, and U). For the AA dinucleotide, the detailed solution procedure is shown as follows: where m is the length of the RNA sequence and was set to 41 in this study, and f AA (t i ) is the criterion for judging whether the dinucleotide is AA or not. For a sequence, if the two adjacent bases are AA, f AA (t i ) = 1; otherwise f AA (t i ) = 0. The statistics of average, expectation, and variance are named n AA , u AA and D AA 2 and can be calculated using the following formula: The average, expectation, and variance of the other 15 dinucleotides (AC, AG, AT, . . . , TG, TT) are also obtained through the above calculation method.

K Monomeric Units (K-mer)
As a simple and effective feature extraction method, K-mer has been widely used in a variety of prediction models [43,44]. The feature extraction principle of K-mer is to count the number of occurrences of k consecutive nucleotides in the RNA sequence. If k is set to three, the number of occurrences of 64 consecutive trinucleotides is calculated, including AAA, AAT, . . . , GGG. To calculate the K-mer of each sample, the counting range of a sequence of length L is obtained from the initial nucleotide until the L − k + 1 th nucleotide. In this study, the values of k were set to 2, 3, and 4, respectively. Therefore, a total of 336 sequence features were extracted via K-mer.

Series Correlation Pseudo Trinucleotide Composition (SCPseTNC)
SCPseTNC was proposed in a previous study by Chen et al. [45]. In the sequencerelated pseudo-trinucleotide composition feature extraction method, the expression of RNA-seq feature vectors combines 12 built-in trinucleotide physicochemical indicators.
For a given RNA sequence, the feature vector extracted by SCPseTNC is defined as follows: where: f k is the normalized frequency of the occurrence of trinucleotides in each sample; λ is an integer, representing the highest count rank or layer associated along the RNA sequence; ω is a weighting factor from 0 to 1; Λ is the number of physicochemical properties; θ j is called the j-layer correlation factor, which reflects the sequence order correlation between all the most adjacent trinucleotides in an RNA sequence and is defined as follows: where: J ξ i,i+m = P u (R i R i+1 R i+2 )P u (R i+m R i+m+1 R i+m+2 ) ξ = 1, 2, · · · , Λ m = 1, 2, · · · λ i = 1, 2, · · · , L − λ − 3 (15) µ represents the number of physical and chemical properties; P u (R i R i+1 R i+2 ) represents the numerical value of the physical and chemical properties of the trinucleotide at a certain position. In this study, a total of 64 sequence features were extracted by using SCPseTNC.

K-Spaced Nucleotide Pair Frequency (Ksnpf)
Ksnpf is an effective extraction method and has been successfully applied to the prediction of DNA N4-methylcytosine sites [46] and RNA 5-methylcytosine (m5C) sites [47]. The formula to calculate Ksnpf is shown as follows: where L is the length of each sample; n 1 and n 2 are the nucleotides, one of A, G, C, and U; Gap(k) represents k arbitrary elements in the interval; S(n1Gap (k) n2) represents the number of occurrences of element pairs. In this study, the value range of k was set from 1 to 5. Thus, an 80-dimensional (4 × 4 × 5) vector was obtained by utilizing Ksnpf.

CatBoost
CatBoost is an extension framework of gradient boosting (GB), which has been successfully applied in various fields such as driving style recognition [48] and diabetes prediction [49]. It was developed by Prokhorenkova et al. in 2017 [50,51]. This algorithm can guarantee that all features can be trained using the manner of ranking promotion and effectively reduce the risk of overfitting [52]. CatBoost utilizes the following strategies to solve the problem of overfitting.
First, the supposition is that D is the dataset provided in this study: where n is the number of samples (i = 1,2, . . . ,n), X i = (x 1 i , x 2 i , . . . , x m i ) is a feature vector of a sample, x m i is the mth feature vector of the sample, and Y i is the target value of the sample. The feature value can be converted via Formula (18): where α is the a priori weight, ϕ is the indicator function, and p is the prior value.

Performance Evaluation
In this study, 10-fold cross-validation was used to evaluate the performance of m6Aminer. Every sub-training dataset was split into 10 roughly equal parts wherein 9 folds were regarded as a training dataset, and the remaining fold was considered a validation dataset for a 10-fold cross-validation. This process was repeated until all folds were selected as the validation dataset. Finally, the average of 10 validation results was used for the performance evaluation. Additionally, six widely used metrics in binary classification were used to evaluate the performance of our model, including accuracy (ACC), sensitivity where TP, TN, FP, and FN represent the numbers of true positive, true negative, false positive, and false negative results, respectively. ACC is the most basic evaluation metric, which describes whether the prediction of the overall results is correct. The ROC curve is drawn according to the dynamic thresholds, with the true positive rate (sensitivity) as the ordinate and the false positive rate (1-specificity) as the abscissa. The AUC is the area under the ROC curve. As a frequently used metric, the AUC can reliably eliminate the impact of sample category imbalance on the indicator results. Normally, the AUC value is between 0.5 and 1. If the AUC is equal to 1, the classifier is proven to be a perfect classifier. On the contrary, if the AUC is equal to 0.5, the classifier is a random classifier, indicating that the model does not have predictive significance. Substantially, the higher the AUC value is, the better the model performs.